arXiv:l506.08435v3 [cs.NA] 9 Apr 2016 


Large-scale optimization-based non-negative 

computational framework for diffusion equations: 
Parallel implementation and performance studies 


An e-print of the paper is available on arXiv: 1506.08435. 

Authored by 

J. Chang 

Graduate Student, University of Houston. 

S. Karra 

Staff Scientist, Los Alamos National Laboratory. 

K. B. Nakshatrala 

Department of Civil & Environmental Engineering 
University of Houston, Texas 77204-4003. 
phone: +1-713-743-4413, e-mail: knakshatrala@uh.edu 
website: http://www.cive.uh.edu/faculty/nak.shatrala 


0.000 0.052 0.103 0.15 5 0.20 7 

Max: 0.207 
Min: -0.012 



time = 180 days 


This figure shows the fate of chromium after 180 days using the single-field Galerkin 
formulation. The white regions indicate the violation of the non-negative constraint. 


2015 

Computational & Applied Mechanics Laboratory 













Large-scale Optimization-based Non-negative Computational 
Framework for Diffusion Equations: Parallel Implementation and 

Performance Studies 


J. Chang, S. Karra and K. B. Nakshatrala 

Correspondence to: e-mail: knakshatrala@uh.edu, ph.one;+l-713-743-4418 

Abstract. It is well-known that the standard Galerkin formulation, which is often the formulation 
of choice under the finite element method for solving self-adjoint diffusion equations, does not meet 
maximum principles and the non-negative constraint for anisotropic diffusion equations. Recently, 
optimization-based methodologies that satisfy maximum principles and the non-negative constraint for 
steady-state and transient diffusion-type equations have been proposed. To date, these methodologies 
have been tested only on small-scale academic problems. The purpose of this paper is to systematically 
study the performance of the non-negative methodology in the context of high performance computing 
(HPC). PETSc and TAO libraries are, respectively, used for the parallel environment and optimization 
solvers. For large-scale problems, it is important for computational scientists to understand the computa¬ 
tional performance of current algorithms available in these scientihc libraries. The numerical experiments 
are conducted on the state-of-the-art HPC systems, and a single-core performance model is used to better 
characterize the efficiency of the solvers. Our studies indicate that the proposed non-negative computa¬ 
tional framework for diffusion-type equations exhibits excellent strong scaling for real-world large-scale 
problems. 


1. INTRODUCTION 

The modeling of flow and transport in subsurface is vital for energy, climate and environmental ap¬ 
plications. Examples include CO 2 migration in carbon-dioxide sequestration, enhanced geothermal sys¬ 
tems, oil and gas production, radio-nuclide transport in a nuclear waste repository, groundwater contam¬ 
ination, and thermo-hydrology in the Arctic permafrost due to the recent climate change [10,15,17,23]. 
Several numerical codes (e.g., FEHM [40], TOUGH [37], PELOTRAN [22]) have been developed to 
model flow and transport in subsurface at reservoir-scale. These codes typically solve unsteady Darcy 
equations for flow and advection-diffusion equation for transport. The predictive capability of a nu¬ 
merical simulator depends on the robustness of the underlying numerical methods. A necessary and 
essential requirement is to satisfy important mathematical principles and physical constraints. One such 
property in transport and reactive-transport problems is that the concentration of a chemical species 
cannot be negative. Mathematically, this translates to the satisfaction of the discrete maximum princi¬ 
ple (DMP) for diffusion-type equations. Subsurface flow and transport applications typically encounter 
geological media that are highly heterogeneous and anisotropic in nature, and it is well-known that 

Key words and phrases, high performance computing; anisotropic diffusion; maximum principles; non-negative con¬ 
straint; large-scale optimization. 
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the classical finite element (or finite volume and finite difference, for that matter) formulations do not 
produce non-negative solutions on arbitrary meshes for such porous media [6,24,32,35]. 

Several studies over the years have focused on the development of methodologies that enforce the 
DMP and ensure non-negative solutions [32,33,35,36]. However, these studies did not address how 
these methods can be used for realistic large-scale subsurface problems that have millions of grid nodes. 
Furthermore, complex coupling between different physical processes as well as the presence of multiple 
species amplify the degrees-of-freedom (i.e., the number of unknowns). The aim of this paper is to de¬ 
velop a parallel computational framework that solves anisotropic diffusion equations on general meshes, 
ensures non-negative solutions, and can be employed to solve large-scale realistic problems. 

Large-scale problems can be tackled by using recent advancements in high-performance computing 
(HPC) methods and toolkits that can be used on the state-of-the-art supercomputing architecture. One 
such toolkit is PETSc [3], which provides data structures and subroutines for setting up structured 
and unstructured grids, parallel communication, linear and non-linear solvers, and parallel I/O. These 
high-level data structures and subroutines help in faster development of parallel application codes and 
minimize the need to program low-level message passing, so that the domain scientists can focus more 
on the application. To this end, we develop a non-negative parallel framework by leveraging the existing 
capabilities within PETSc. Our framework ensures the DMP for anisotropic diffusion by using lower- 
order finite elements and the optimization-based approach in [24,32,35]. The TAO toolkit [31], which 
is built on top of PETSc, is used for solving the resulting optimization problems. The robustness of the 
proposed framework will be demonstrated by solving realistic large-scale problems. 

The rest of this paper is organized as follows. In Section 2, we present the governing equations 
and the classical single-field Galerkin finite element formulation for steady-state and transient diffusion 
equations. The optimization-based method to ensure non-negative concentrations is also outlined in this 
section. In Section 3, the parallel implementation procedure using PETSc and TAO is presented. We also 
highlight the relevant data structures used in this study and present a pseudo algorithm describing our 
parallel framework. In Section 4, a performance model loosely based on the roofline model is outlined. 
This model is used to estimated the efficiency with respect to computing hardware of currently available 
solvers within PETSc and TAO. In Section 5, we first verify our implementation using a 3D benchmark 
problem from the literature and present a detailed performance study using the proposed model. Then, 
we study a large-scale three-dimensional realistic problem involving the transport of chromium in the 
subsurface and document the numerical results of the non-negative methodology with the classical 
single-field Galerkin formulation. Conclusions are drawn in Section 6. 


2. GOVERNING EQUATIONS AND ASSOCIATED NON-NEGATIVE NUMERICAL 

METHODOLOGIES 

Let D C be a bounded open domain, where “nd” is the number of spatial dimensions. The 
boundary of the domain is denoted by clD = D — D, which is assumed to be piecewise smooth. A spatial 
point is denoted by x G D. The gradient and divergence operators with respect to x are, respectively, 
denoted as grad[-] and div[-]. As usual, the boundary is divided into two parts: P^ and P^. P^ is that 
part of the boundary on which Dirichlet boundary conditions are prescribed, and P^ is the part of the 
boundary on which Neumann boundary conditions are prescribed. For mathematical well-posedness, 
we assume P'^ U P^ = dQ and P'^ H P^ = 0. The unit outward normal to boundary is denoted as 
n(x). The diffusivity tensor is denoted by D(x), which is assumed to symmetric, bounded above and 
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uniformly elliptic. That is, 


D(x) = d'^(x) Vx g n 


( 2 . 1 ) 


and there exists two constants 0 < < ^2 < +00 such that 

< y^D(x)y < ^y'^'y Vx e ll and Vy G M”"* (2.2) 


2.1. Governing equations for steady-state response. We shall denote the steady-state con¬ 
centration field by c(x). The governing equations can be written as follows: 


div[D(x)grad[c]] = /(x) 

in H 

(2.3a) 

c(x) = cP(x) 

on F^ 

(2.3b) 

n(x) • D(x)grad[c] = g''’(x) 

on F^ 

(2.3c) 


where /(x) is the volumetric source/sink, cP(x) is the prescribed concentration, and ^^(x) is the pre¬ 
scribed flux. For uniqueness, we assume T^ 7 ^ 0. 

2.1.1. Maximum principle and the non-negative constraint. The above boundary value problem is 
a self-adjoint second-order elliptic partial differential equation (PDF). It is well-known that such PDFs 
possess an important mathematical property - the classical maximum principle [8]. The mathematical 
statement of the classical maximum principle can be written as follows: If c(x) G C‘^{Q) H C'^(n), 
dit = T^, and /(x) < 0 in 11 then 

maxc(x) = max c’’(x) (2.4) 

xen xean 

Similarly, if /(x) > 0 in 11 then 

minc(x) = min cP(x) (2-5) 

xen xe90 

To make our presentation on maximum principles simple, we have assumed stronger regularity on the 
solution (i.e., c(x) G C'^nC'^(ll)), and assumed that Dirichlet boundary conditions are prescribed on the 
entire boundary. However, maximum principles requiring milder regularity conditions on the solution, 
even for the case when Neumann boundary conditions are prescribed on the boundary, can be found in 
literature (see [29,30]). 

If/(x) > 0 in n and cP(x) > 0 on the entire till then the maximum principle implies that c(x) > 0 
in the entire domain, which is the non-negativity of the concentration field. 

2 . 1 . 2 . Single-field Galerkin weak formulation. The following function spaces will be used in the rest 
of this paper: 

U := {c(x) G if^(ll) I c(x) = cP(x) on T^} (2.6) 

W := {rc(x) G Ff^(ll) | w{-x.) = 0 on P^} (2.7) 

where 77^(11) is a standard Sobolev space [1]. The single-field Galerkin weak formulation corresponding 
to equations (2.3a)-(2.3c) reads: Find c(x) G U such that we have 

B{w; c) = L{w) Vu;(x) G W 
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( 2 . 8 ) 


where the bilinear form and linear functional are, respectively, defined as 


B{w,c) := / grad['u;(x)] • D(x)grad[c(x)] dn (2.9a) 

Jn 

L{w) := [ t(;(x)/(x) dn + [ rt;(x)gP(x) dP (2.9b) 

Jn JrN 

Since D(x) is symmetric, by Vainberg’s theorem [13], the single-field Galerkin weak formulation given 
by equation (2.8) is equivalent to the following variational problem: 

minimize -B{c',c) — L{c) (2-10) 

c(x)ew 2 

2.1.3. A methodology to enforce the maximum principle for steady-state problems. Our methodology 
is based on the finite element method. We decompose the domain into ‘We/e” non-overlapping open 
element sub-domains such that 

Nele 

n = y n" (2.11) 

e=l 

(Recall that a superposed bar denotes the set closure.) The boundary of is denoted by := kl®. 
We shall define the following finite dimensional vector spaces of U and W: 

:= |c'*(x) gU I c'*(x) E 0°(n), c^(x)|^, E P'=(0^),e = 1, • • • ,Nele^ (2.12a) 

W'* ;= |?u'^(x) E W I E C°(n),u;'^(x)|^, GF^{Q^),e = 1, • • • ,Nele^ (2.12b) 

where k is a non-negative integer, and P^(fl®) denotes the linear vector space spanned by polynomials 
up to k-th order defined on the sub-domain OF. The finite element formulation for equation (2.8) can 
be written as: Find c^(x) E such that we have 

i3(g^c'*) = L(/) Vg^(x) E (2.13) 

It has been documented in the literature that the above finite element formulation violates the maximum 
principle and the non-negative constraint [24,32,35]. 

We now outline an optimization-based methodology that satisfies the maximum principle and the 
non-negative constraint on general computational grids. To this end, we shall use the symbols ^ and 
^ to denote component-wise inequalities for vectors. That is, for given any two vectors a and b 

a <b means that a* < 6* Mi (2-14) 

The symbol ^ can be similarly defined as well. Let < •;• > denote the standard inner-product in 
Euclidean space. After finite element discretization, the discrete equations corresponding to equation 
(2.13) take the form 

Kc = f (2.15) 

where IT is a symmetric positive definite matrix, c is the vector containing nodal concentrations, and 
/ is the force vector. Equation (2.15) is equivalent to the following minimization problem 

minimize - (c;iTc)-(c;/) (2.16) 

where “ndo/s” denotes the number of degrees of freedom for concentration. Equation (2.15) can lead 
to unphysical negative solutions. 
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Following [24,32], a methodology corresponding to equation (2.16) that satishes the non-negative 
constraint can be written as follows: 

minimize - < c; Kc > — < c] f > (2.17a) 

subject to 0 ^ c (2.17b) 

where 0 is a vector of size ndofs containing zeros. Since K is positive definite, equation (2.17) has a 
unique global minimum [4]. Several robust numerical methods can be used to solve equation (2.17), 
which include active set strategy, interior point methods [4]. In this paper, to solve the resulting 
optimization problems, we shall use the parallel optimization toolkit TAO [31], which has the active- 
set Newton trust region (TRON) and quasi-Newton-based bounded limited memory variable metric 
(BLMVM) algorithms. 


2.2. Governing equations for transient response. We shall denote the time by t G [0,X], where 
X denotes the length of the time interval of interest. We shall denote the time-dependent concentration 
by c(x, t). The initial boundary value problem can be written as follows: 


dc 

— = div[D(x)grad[c]] /(x,t) 

in n X (0,X) 

(2.18a) 

c(x, t) = cP(x, t) 

on r° X (0,X) 

(2.18b) 

n(x) • D(x)grad[c] = q^{x,t) 

on X (0,X) 

(2.18c) 

c(x,0) = co(x) 

in n 

(2.18d) 


where co(x) is the prescribed initial concentration, /(x,t) is the time-dependent volumetric source/sink, 
cP(x, t) is the time-dependent prescribed concentration on the boundary, and ^^(x, t) is the prescribed 
time-dependent flux on the boundary. 

2.2.1. Maximum principle and the non-negative constraint. The maximum principle of a transient 
diffusion equation asserts that the maximum can occur only on the boundary of the domain or in the 
initial condition if /(x, t) < 0 and T^ = cAl. Mathematically, a solution to equations (2.18a)-(2.18a) 
will satisfy: 


c(x, t) < max 


maxco(x), max c^fx, t) 
xefi ^ xean ' ^ 


Vt 


(2.19) 


provided /(x, t) < 0. Similarly, the minimum will occur either on the boundary or in the initial condition 
if f{x,t) > 0. That is, if f{x,t) > 0 then a solution to equations (2.18a)-(2.18a) satisfies: 


c(x,f) > min 


mincn(x), min c^(x,t) 
xen xean 


Vf 


( 2 . 20 ) 


If f{x,t) > 0 in n, cP(x, f) > 0 on the entire dit, and co(x) > 0 in then the maximum principle 
implies that c(x, t) > 0 in the entire domain at all times, which is the non-negative constraint for the 
concentration field for transient problems. 

2 .2.2. A methodology to enforce the maximum principle for transient problems. We divide the time 
interval of interest into M sub-intervals. That is, 

JV 

[0,X]:=]J[tn,tn+l] (2.21) 

n=0 
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where denotes the n-th time-level. We assume that the time-step is uniform, which can be written 
as: 

At = tn+l - tn (2.22) 

Following the recommendation provided in [34] to meet maximum principles, we employ the backward 
Euler method for temporal discretization. We shall denote the nodal concentrations at the n-th time- 
level by We shall denote the minimum and maximum values for the concentration by Cmin and Cmax; 
which will be provided by the maximum principle and the non-negative constraint. At each time-level, 
one has to solve the following convex quadratic program: 

minimize (2.23a) 

c("+i) 2 

subject to Cminl A ^ Cmaxl (2.23b) 

where 

K := -^M + K 
At 

~(n+l) «(„+!) 

At 

In the above equation, M is the capacity matrix [34]. 

3. PARALLEL IMPLEMENTATION 

3.1. PETSc and TAO. We leverage on existing scientific libraries such as PETSc and TAO to 
formulate our large-scale computational framework. PETSc is a suite of data structures and routines 
for the parallel solution of scientific applications. It also provides interfaces to several other libraries 
such as Metis/ParMETIS [16] and HDE5 [38] for mesh partitioning and binary data format handling 
respectively. The Data Management (DM) data structure is used to manage all information including 
vectors and sparse matrices and compatible with binary data formats. To handle unstructured grids in 
parallel, a subset of the DM structure called DMPlex (see [3,20,21]), as shown in Figure 1, uses the 
direct acyclic graph to organize all mesh information. This topology enables the freedom to mix and 
match various non vertex-based discretization such as the two-point flux finite volume method and the 
classical mixed formulations based on the lowest-order Raviart Thomas finite element space. 

Another important feature within PETSc is TAO. The TAO library has a suite of data structures and 
routines that enable the solution of large-scale optimization problems. It can support any data structure 
or solver within PETSc. Our non-negative methodology will use both the Newton-Trust Region (TRON) 
and Bounded Limited-Memory Variable-Metric (BLMVM) solvers available within TAO. BLMVM is 
a quasi-Newton method that uses projected gradients to approximate the Hessian, which is useful for 
problems where the hessian is too complicated or expensive to compute. Other optimization algorithms 
such as TRON and the Gradient Projected Conjugate Gradient (GPCG) typically require Hessian 
information and more memory, but they are expected to converge more rapidly than BLMVM. Eurther 
details regarding the implementation of these various methods may be found in [31] and the references 
within. 

3.2. Finite element implementation. PETSc abstractions for finite elements, quadrature rules, 
and function spaces have also been recently introduced and are suitable for the mesh topology within 
DMPlex. They are built upon the same framework as the Finite element Automatic Tabulator (FIAT) 
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depth 0 


depth 1 



2D and 3D elements 



Figure 1. Representation of mesh points within the DMPlex data structure and their 
associated directed acyclic graphs 


found within the FEniCS Project [18,26,27], The finite element discretizations simply need the equa¬ 
tions, auxiliary coefficients (e.g., permeability, diffusivity, etc.), and boundary conditions specified as 
point-wise functions. We express all discretizations in nonlinear form so let r and J denote the residual 
and Jacobian respectively. 
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Following the FEM model outlined in [19], we consider the weak form that depends on fields and 
gradients. The residual evaluation can be expressed as: 


m'^r(c) ~ / [u) • To (c, Vc) + Vtc • .^i(c, Vc)] dn = 0 (3-1) 

where To(c, Vc) and Ti{c, Vc) are user-defined point-wise functions that capture the problem physics. 
This framework decouples the problem specification from the mesh and degree of freedom traversal. 
That is, the scientist need only focus on providing point function evaluations while letting the finite 
element library take care of meshing, quadrature points, basis function evaluation, and mixed forms if 
any. The discretization of the residual is written as: 


Nele 

r{c) = A [ ]W 


•^o(Cg, ^Cq) 

^ 1 (Cg, VCg) 


(3.2) 


where A represents the standard assembly operator, Af and B are matrix forms of basis functions that 
reduce over quadrature points, W is a diagonal matrix of quadrature weights (including the geometric 
Jacobian determinant of the element), and Cg is the field value at quadrature point q. The Jacobian of 
(3.2) needs only the derivatives of the point-wise functions: 


J(c) = 


Nele 

A[ 


B' 


e=l 


w 


-^ 0,1 

^1,1 


AT 

B 





dTo 

dc 

5Vc 

d:Fi 

dTi 

dc 

9Vc 


{Cq, VCg 


The point-wise functions corresponding to the weak form in (2.9) would be: 


To = -/(x), Ti = D(x)Vcg 

To,o = 0, Tb,i = 0, Tgo = 0, Tip = D(x) 


(3.3) 


(3.4a) 

(3.4b) 


Similarly, the point-wise functions for the transient response are: 


-^0 = Cg -/(x,t), Ti = D(x)Vcg (3.5a) 

-^0,0 = = 0, Ti,o = 0, Tip = D(x) (3.5b) 

where Cg denotes the time derivative. A similar discretization is used to project the Neumann boundary 
conditions into the residual vector. Assuming a fixed time-step, [Tp] and the Jacobian in equation 
(3.3) do not change with time and have to be computed only once. If n denotes the time level (n = 0 

denotes initial condition) then the residual and Jacobian can be defined as: 

rH = r(cW) (3,6) 

J = J(c(°)) (3.7) 


To enforce the non-negative methodology, the following objective function b and gradient function g is 
provided: 




g = J 


^(n+l) _ ^(n) 


M) 


(3.8) 

(3.9) 


BLMVM relies only on the above two equations, whereas TRON needs the Hessian which is equivalent 
to J. Algorithm 1 outlines the steps taken in our computational framework. 














4. PERFORMANCE MODELING 


PETSc is a constantly evolving open-source library that brings out new features and algorithms 
almost every day. It has capabilities to interface with a large number of other open-source software and 
linear algebra packages. However, it is not always known which of these algorithms will have the best 
performance across multiple distributed memory HPC systems, especially if these packages have little 
documentation and have to be used as black-box solvers. Computational scientists would like to know 
which solvers or algorithms to use for their specific need before running jobs on the state-of-the-art HPC 
systems. The first and the trivial metric to look for in answering this question is the time-to-solution for 
a given solver or optimization method. However, additional information is needed in order to quantify 
the hardware and algorithmic efficiency as well as the potential scalability across multiple cores in the 
strong sense. 

Hardware specifications of HPC systems significantly impact the performance of any numerical 
algorithm. Ideally we want our simulations to consume as little wall-clock time as possible as the 
number of processing cores increases (i.e., achieving good speedup), but several other factors including 
compiler vectorization, cache locality, memory bandwidth, and code implementation may drastically 
affect the performance. Table 1 lists the hardware specifications of the two HPC systems (Mustang 
and Wolf) that are used in our numerical experiments. The Mustang HPC system consists of relatively 


Algorithm 1 Pseudocode for the large-scale transport solver 
Create/input DAG on rank 0 
Create/input cell-wise velocity on rank 0 
if size > 1 then 

Partition mesh among all processors 

end if 

Refine distributed mesh if necessary 
Create PetscSection and FE discretization 
Set n = 0 and = 10“® 

Insert Dirichlet BC constraints into 
Compute Jacobian J 

while true do i> Begin time-stepping scheme 

Compute Residual 

if Classical Galerkin then > Solve without non-negative methodology 

f.{n+l) _ ^(n) _ J\j{n) 

else t> Solve with non-negative methodology 

TaoSolve() for based on equations (3.8) and (3.9) 

end if 

if steady-state or (n) == total number of time steps then 
break 
else 

n+ = 1 

end if 
end while 
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Table 1. List of HPC systems used in this study 



Mustang (MU) 

Wolf (WF) 

Processor 

AMD Opteron 6176 

Intel Xeon E5-2670 

Clock rate 

2.3 GHz 

2.6 GHz 

FLOPs/cycle 

4 

8 

Sockets per compute node 

2 

2 

NUMA nodes per socket 

2 

1 

Cores per socket 

12 

8 

Total cores (compute nodes) 

38400 (1600) 

9856 (616) 

Memory per compute node 

64 GB 

64 GB 

LI cache per core 

128 KB 

32 KB 

L2 cache per core 

512 KB 

256 KB 

L3 cache per socket 

12 MB 

20 MB 

Interconnect 

40 Gb/s 

40 Gb/s 


older generation of processors so it is expected to not perform as well. One could simply measure wall- 
clock time across multiple compute nodes on the respective HPC machines and determine the parallel 
efficiency of a certain algorithm, but we are interested in quantifying how different algorithms behave 
sequentially and what kind of parallel performance to expect before running numerical simulations on 
supercomputers. The wall-clock time of any simulation can generally be summed up as a function of 
three things: the workload, transfer of data between the memory and CPU register, and interprocess 
communication. Hardware efficiency in this context is defined as the amount of time spent performing 
work over waiting on memory fetching and cache registers to free up. 

The limiting factor of performance for numerical methods on modern computing architectures is 
upper-bounded by the memory bandwidth. That is, the floating point performance given by FLOPS/s 
will never reach the theoretical peak performance (TPP). This limitation is particularly important 
for iterative solvers and optimization methods that rely on numerous sparse matrix-vector (SpMV) 
multiplications (see [28] and the references within). The frequent use of SpMV allows for little cache 
reuse and will result in a large number of very expensive cache misses. Such behavior is important 
to document when determining how efficient a scientific code is, so performance models such as the 
Roofline Model [25,39], which measures memory transfers, have been used to better quantify the 
efficiency with respect to the hardware. Performance models in general can help application developers 
identify bottlenecks and indicate which areas of the code can be further optimized. In other words, the 
code can be designed so that it maximizes the full benefits of the available computing resources. In the 
next section, we will demonstrate that such models can also be used to predict the parallel efficiency 
of various optimization solvers on the two very different LANL HPC systems. The key parameter for 
these performance models is the Arithmetic Intensity (AI) which is defined as: 


AI 


Total FLOPS 
Total Bytes Transferred 


(4.1) 


where the Total Bytes Transferred (TBT) metric denotes the amount of bandwidth needed for a given 
floating point operation. The AI serves as a multiplier to the actual memory bandwidth and creates a 
“roofline” for the estimation of ideal peak performance. A cache model is needed in order to properly 
define the TBT. 
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Table 2. Commonly used PETSc operations and their respective Total Bytes Trans¬ 
ferred. Here we note X, Y, Z as vectors with i = 1, • • • ,N entries, a is a scalar value, 
and nz denotes the total number of non-zeros. We assume that the sizes of integers and 
doubles are 4 and 8 bytes respectively. 


PETSc function 

Operation 

Total Bytes Transferred 

VecNorm() 


8 (iV + l) 

VecDot() 


8{2N + 1) 

VecCopyO 

Y 

8{2N) 

VecSet() 

Y{i) = a 

8{2N) 

VecScale() 

Y = a*Y 

8{2N) 

VecAXPYO 

Y = a*X + Y 

8{3N) 

VecAYPXO 

Y = X + a*Y 

8(3V) 

VecPointwiseMult () 

Z(i) = X(i)*Y(i) 

8(3V) 

MatMult() 

SpMV 

4(Y -|- nz) -t- 8{2N -|- nz) 


To this end, we propose a roofline-like performance model where the TBT assumes a “perfect 
cache” - each byte of the data needs to be fetched from DRAM only once. This assumption enables 
us to predict a slightly more realistic upper bound of the peak performance than by simply comparing 
to the TPP. Table 2 lists the key PETSc functions used for the solvers and their respective estimates 
of TBT based on the perfect cache assumption. The formula for SpMV follows the procedure outlined 
in [9]. We assume that the TBT formula for operations also involving a sparse matrix and vector like 
the incomplete lower-upper (ILU) factorization to be the same as MatMult(). Estimating the TBT 
for other important operations like the sparse matrix-matrix and triple matrix products (which are 
important for multi-grid methods) is an area of future work. In short, our AI formulation relies on the 
following four key assumptions: 

(i) All floating-point operations (add, multiply, square roots, etc.) are treated equally and equate to 
one FLOP count. 

(ii) There are no conflict misses. That is, each matrix and vector element is loaded into cache only 
once. 

(iii) Processor never waits on a memory reference. That is, any number of loads and stores are satisfied 
in a single cycle. 

(iv) Compilers are capable of storing scalar multipliers in the register only for pure streaming compu¬ 
tations. 


Therefore, the efficiency based on this new roofline-like performance model as: 


Efficiency (%) 


Measured FLOPS/s 

, r TPP 
in < 

\ AI X STREAMS 


X 100 


(4.2) 


where the numerator is reported by the PETSc program and the denominator is the ideal performance 
upper-bounded by both the TPP and the product of AI and STREAMS bandwidth. STREAMS Triad 
[14] is one of the most popular benchmarks for determining the achievable memory performance of a 
given machine. Figure 2 denotes the estimated memorybandwidth as a function of number of cores on 
a single Mustang and Wolf node. It is interesting to note that although the Wolf node has a greater 
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Figure 2. Estimated memory bandwidth of a single Mustang and Wolf compute node 
based on the STREAMS Triad Benchmark 

bandwidth, there is no performance gain past eight cores. This means that an optimal use of a Wolf 
compute node for memory-bandwidth bound algorithms would be eight cores, whereas one would still 
see some performance gains when using all 24 cores on a Mustang node. 

The performance model that uses equation (4.2) is a serial model so the STREAMS metric for the 
Mustang and Wolf systems are 5.65 GB/s and 15.5 GB/s respectively. R should be noted that this 
performance model does account for cache effects. That is, it does not quantify the useful bandwidth 
sustained for some level of cache. The true hardware and algorithmic efficiency is not be reflected by this 
model, so our aim is to show relative performance between select PETSc and TAG solvers. Comparing 
the AI and the measured FLOPS/s with the STREAMS bandwidth will give us a better understanding 
of how high-performing the PETSc and TAG solvers are for select problems. 

5. REPRESENTATIVE NUMERICAL RESULTS 

In this section, we compare the performance of our non-negative methodology using the TAG solver 
to that of the Galerkin formulation using the Krylov Subspace (KSP) solver. We examine the perfor¬ 
mance using two problems: 

(i) a unit cube with a hole under steady-state, and 

(ii) a transient Chromium transport problem. 

The diffusivity tensor is assumed to depend on the flow velocity through 

D(x) = (q:t||v|| -b Dm) I -b (q;l - Q:t) ^|^||^ (5.1) 

where ai, ot, and Dm denote the longitudinal dispersivity, transverse dispersivity and molecular 
diffusivity, respectively. We employ the conjugate gradient method and the block Jacobi/ILU(0) pre¬ 
conditioner for solving the linear system from the Galerkin formulation and employ TAG’s TRGN and 
BLMVM methods for the non-negative methodology. The relative convergence tolerances for both KSP 
and TAG solvers are set to 10“®, and At for the transient response in the Chromium problem is initially 
set to 0.2 days. For strong-scaling studies shown here, we used GpenMPI vl.6.5 for message passing and 
bound processes to cores while mapping by sockets. ParaView [2] and Visit [5] were used to generate 
all contour and mesh plots. 
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(a) Location of the hole 


(b) Mesh type A 




(c) Mesh type B 


(d) Mesh type C 


Figure 3. Cube with a hole: pictorial description and the associated unstructured grids. 

Remark 1. Throughout the paper, the non-negative methodology that we refer to, is in fact a 
discrete maximum prineiple preserving methodology, in that, along with the non-negative constraint we 
also enforce that the concentrations are less than or equal to 1. 

5.1. Anisotropic diffusion in a unit cube with a cubic hole. Let the computational domain 
be a unit cube with a cubic hole of size [4/9, 5/9] x [4/9, 5/9] x [4/9,5/9]. The concentration on the 
outer boundary is taken to be zero and the concentration on the interior boundary is taken to be unity. 
The volumetric source is taken as zero (i.e., /(x) = 0). The velocity vector field for this problem is 
chosen to be 

v(x) = ex + Gy + Gz (5.2) 


The diffusion parameters are set as: ai = 1, ot = 0.001, and Dm = 0. The pictorial description 
of the computational domain and the three mesh types composed of 4-node tetrahedrons are shown in 
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Figure 4. Cube with a hole: numerical solution for cases A1 (left), B2 (middle), and C3 
(right) using the Galerkin formulation (top row) and non-negative methodology (bottom 
row). 

Table 3. Cube with a hole: list of various mesh type and refinement level combinations used 


Case 

Mesh type 

Refinement level 

Tetrahedrons 

Vertices 

A1 

A 

1 

199,296 

36,378 

B1 

B 

1 

409,848 

75,427 

Cl 

C 

1 

793,824 

140,190 

A2 

A 

2 

1,594,368 

278,194 

B2 

B 

2 

3,278,784 

574,524 

C2 

C 

2 

6,350,592 

1,089,562 

A3 

A 

3 

12,754,994 

2,175,330 

B3 

B 

3 

26,230,272 

4,483,126 

C3 

C 

3 

50,804,736 

9,172,044 


Figure 3. We consider three unstructured mesh types with three levels of element-wise mesh refinement, 
giving us nine total case studies of increasing problem size as shown in Table 3. We ran a total of five 
different simulations for this study: 
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Table 4. Cube with a hole: minimum and maximum concentrations for each case 


Case 

Min. concentration 

Max. concentration 

% nodes violated 

A1 

-0.0224825 

1.00000 

9,518/36,378 ^ 26.2% 

B1 

-0.0139559 

1.00000 

32,247/43,180 ^ 42.8% 

Cl 

-0.0125979 

1.00000 

57,272/140,190 ^ 40.9% 

A2 

-0.0311518 

1.00103 

82,983/278,194 ^ 29.2% 

B2 

-0.0143857 

1.00000 

255,640/574,524 ^ 44.9% 

C2 

-0.0119539 

1.00972 

453,766/1,089,562 ^ 41.6% 

A3 

-0.0258559 

1.00646 

643,083/2,175,330 ^ 29.6% 

B3 

-0.0115908 

1.00192 

2,073,934/4,483126 ^ 46.3% 

C3 

-0.0096186 

1.00545 

4,932,551/9,172,044 ^ 53.8% 


• Galerkin with CG/block Jacobi 

• TRONl: with KSP tolerance of 10“^ 

• TRON2: with KSP tolerance of 10“^ 

• TRON3: with KSP tolerance of 10“^ 

• BLMVM 

Table 5. Cube with a hole: wall-clock times (seconds) on Mustang for each solver 


Case 

Galerkin 

TRONl 

TRON2 

TRON3 

BLMVM 

A1 

0.337 

0.933 

0.981 

1.14 

2.62 

B1 

0.790 

1.72 

2.06 

2.71 

5.04 

Cl 

2.24 

4.34 

5.80 

7.74 

13.5 

A2 

7.21 

15.2 

21.7 

32.5 

72.0 

B2 

15.4 

30.0 

43.7 

57.5 

109 

C2 

40.4 

67.8 

113 

118 

286 

A3 

121 

225 

414 

599 

1167 

B3 

315 

498 

1061 

1344 

2524 

C3 

997 

1539 

2490 

4365 

9679 


Table 6 . Cube with a hole: wall-clock times (seconds) on Wolf for each solver 


Case 

Galerkin 

TRONl 

TRON2 

TRON3 

BLMVM 

A1 

0.126 

0.388 

0.396 

0.449 

1.01 

B1 

0.314 

0.720 

0.853 

1.07 

2.03 

Cl 

0.888 

1.91 

2.47 

3.31 

5.71 

A2 

2.58 

6.34 

8.74 

12.8 

26.2 

B2 

5.90 

12.9 

17.8 

22.8 

46 

C2 

16.2 

30.1 

47.3 

48.9 

133 

A3 

48.0 

98.4 

129 

247 

609 

B3 

107 

171 

342 

435 

1060 

C3 

281 

467 

870 

1245 

3131 
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Figure 5. Cube with a hole: solver iterations needed for Galerkin and BLMVM. 




Figure 6 . Cube with a hole: KSP (left) and TAO (right) solver iterations needed for TRON. 


The TRON solvers also use the CG and block Jacobi preconditioner but with different KSP tolerances. 
Numerical results for both the Galerkin formulation and the non-negative methodologies for some of 
the mesh cases are shown in Figure 4. The top row of figures arise from the Galerkin formulation 
where the white regions denote negative concentrations, and the bottom row arise from either TRON 
or BLMVM. Details concerning the violation of the DMP for each case study can be fonnd in Table 4. 
Concentrations both negative and greater than one arise for all case studies. Moreover, simply rehning 
the mesh does not resolve these issues; in fact, rehnment worsens the violation. These numerical results 
indicate that our computational framework can successfully enforce the DMP for diffusion problems 
with highly anisotropic diffusivity. 

5.1.1. Performance modeling. We first consider the wall-clock time spent in the solvers on a single 
core. Table 5 and 6 depict the solver time for each mesh, and we first note that Mustang system requires 
significantly more wall-clock time to obtain a solution than Wolf; this behavior is expected due to the 
difference in HPC hardware specifications listed in Table 1, specifically, Mustang has the lower clock rate 
and lower bandwidth (as determined through STREAMS Triad). It can also be seen that the various 
non-negative solvers consume varying amounts of wall-clock time. BLMVM can require as much as ten 
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(b) Wolf 


Figure 7. Cube with a hole: measured floating-point rate (FLOPS/s) on a single core. 


times the amount of wall-clock time as the standard Galerkin method. TRON on the other hand, does 
not consume nearly as much time but tightening the KSP tolerances will gradually increase the amount 
of time. We are interested in determine why these optimization solvers consume more wall-clock time, 
whether it be mostly due to additional workload associated with optimization-based techniques or due 
to the presence of relatively more complicated and expensive data structures compared to the standard 
solvers used for the Galerkin formulation. The first step is noting the total KSP and TAG iterations 
needed and how they vary with respect to problem size. Figure 5 depicts the KSP and TAO iterations 
for the Galerkin and BLMVM methods respectively. It is well-known that block Jacobi (also known 
as ILU(O)) requires more iterates as the size of the problem increases. In other words, the solver may 
exhibit poor scaling for extremely large problems, but we see that the BLMVM algorithm has an even 
poorer scaling rate of the solver iterates. For the TRON solvers, we document both the KSP and TAO 
iterates as shown in Figure 6. We see that tightening the KSP tolerance increases the number of KSP 
iterates but requires slightly fewer TAO iterates. This behavior indicates that the more accurate the 
computed gradient projection is, the fewer optimization loops the solver has to perform. 
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Figure 8. Cube with a hole: arithmetic intensity for all solvers and all cases on a single 
processor. 


We also examine the measured floating-point rate provided by the PETSc performance logs, as 
shown in Figure 7, of all five solvers across their respective machines, and the floating point performance 
decreases as the problem size grows. One could compare these numbers to the TPP and see that the 
hardware efficiencies are no greater than 5%, but it is difficult to draw any other conclusions with regard 
to the computational performance. The calculated AI, based on our proposed performance model, is 
shown in Figure 8. It is interesting to note that the AI remains largely invariant with problem size 
unlike the wall-clock time, solver iterations, and floating point rates. According to the perfect cache 
model, the Galerkin formulation’s AI is greater than any of the non-negative methodologies. The 
optimization-based algorithm based on TAO’s BLMVM solver has significantly more streaming/vector 
operations which explains the relatively lower AI. Using these metrics in equation (4.2) as well as the 
STREAMS Triad bandwidth of one core as shown in Figure 2, the estimated roofline-based efficiencies 
are shown in Figure 9. Although the raw floating-point rate of BLMVM is lower than the Galerkin 
method, the roofline model suggests that BLMVM is actually more efficient in the hardware sense. The 
TRON methods have much lower floating-point rates, but these metrics can be improved or “gamed” by 
tightening the KSP tolerances. This behavior leads us to believe that there is some latency associated 
with setting up the data structures needed to compute gradient descent projections. 

5.1.2. Strong-scaling. The metric of most interest to many computational scientists is the strong¬ 
scaling potential of any numerical framework. We conduct strong-scaling studies to measure the speedup 
of all nine case studies over 64 cores. Four Mustang nodes with 16 cores each and 8 Wolf nodes with 
8 cores each are allocated for this study. We do not fully saturate the compute nodes because the 
STREAMS benchmark indicates that there is little or no gain in memory performance when using a full 
node. Figure 10 depicts the speedup on the Mustang system, and Figure 11 depicts the speedup on the 
Wolf system. First, we note that the parallel efficiency (actual speedup over ideal speedup) increases 
with problem size due to Amdahl’s Law. We also note that Wolf exhibits better strong-scaling due to 
the faster speedups for the same test studies. For all problems and machines, the TRON simulations 
are slightly less efficient in the parallel sense but can be improved by tightening the KSP tolerances. 
Interestingly, the BLMVM algorithm not only has the best roofline efficiency but also the best parallel 
speedup. We can infer from these results that although BLMVM is the more efficient optimization in 
the hardware sense, TRON is more efficient in the algorithmic sense due to its lesser time-to-solution. 
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(a) Mustang 



(b) Wolf 


Figure 9. Cube with a hole: estimated floating-point efficiency with respect to the 
arithmetic intensity and measured memory bandwidth from STREAMS. 


Our study has shown that one can draw correlations between the performance models conducted on a 
single-core and the actual speedup across multiple distributed memory nodes. As future solvers and 
algorithms are implemented within PETSc, we can use this performance model to assess how efficient 
they are in both the hardware and algorithmic sense and how efficiently they will scale in a parallel 
setting. 

5.2. Transport of chromium in subsurface. Subsurface clean-up due to anthropogenic con¬ 
tamination is a big challenge [7]. Remediation studies [10,11] need accurate predictions of transport of 
the involved chemical species, which are obtained using limited data at monitoring wells and through 
numerical simulations. To accurately predict the fate of the contaminant, a transport solver that: a) is 
robust, in that it will not give unphysical solutions, and b) can handle field-scale scenarios, is needed. 
The computational framework that is proposed in this paper is an ideal candidate for such problems. 
We now consider a realistic large-scale problem to predict the fate of chromium in the Los Alamos, 
New Mexico area. The chromium was released into the Sandia canyon in the 50s up to early 70s. Back 
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Figure 10. Cube with a hole: speedup for all 9 mesh cases up to 64 processors on the 
Mustang system (16 cores per node). 


then chromium was used as an anti-corrosion agent for the cooling towers at a power plant at the Los 
Alamos National Laboratory (see [12] and references therein for details). 

Here we study the effectiveness of our proposed framework to this real-world scenario of predicting 
the extent of chromium plume. The following is a conceptual model domain that is considered: A 
domain of size [496,503]km x [536,542]km x [0, lOOjm with the permeability field (m^) as shown in 
Figure 12. R42 in Figure 12 is estimated to be the contaminant source location and a pumping well 
is located at R28. The parameters used for this problem are shown in Table 7, and we employ the 
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Figure 11. Cube with a hole: speedup for all 9 mesh cases up to 64 cores on the Wolf 
system (8 cores per node). 

following boundary conditions: 

c^{x = 496km, y, z) = c^{x = 503km, y, z) = c^(x, y = 536km, z) = c^{x, y = 542km, z) = 0 (5.3) 

For this highly heterogeneous problem, we employ PETSc’s algebraic multi-grid preconditioner (GAMG) 
and couple this with the TRON algorithm for the non-negative solver. Our goal is to examine its strong¬ 
scaling potential across 1024 cores. We first solve the steady flow equation (based on mass balance and 
Darcy’s model to relate pressure and mass flux) with the pumping well located at R28. Cell-wise velocity 
is obtained from the resulting pressure field and used to calculated element-wise dispersion tensor. We 
then solve the transient diffusion problem (with tensorial dispersion) with a constant contaminant source 
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Figure 12. Chromium plume migration in the subsurface: Permeability field (m^) and 
the locations of the pumping well (R28) and contaminant source (R42). 


located at R40 for up to 180 days. The concentrations at select time levels for Galerkin formulation 
and non-negative methodology are shown in Figures 13 and 14, respectively. Negative concentrations 
arise with the Galerkin formulation even as the solution approaches steady-state. 

Figure 15 depicts the amount of wall-clock time with respect to the number of cores at the first 
time level. We see here that the demonstrates good parallel performance across 1024 cores with up 
to 35 percent parallel efficiency. Unlike the previous benchmark problem, we consider a case where 
we completely saturate a Wolf node by using all 16 cores and notice that the performance is slightly 
worse than using a partially saturated node (8 cores). This change of behavior can be attributed to 
the lack of memory performance improvement one achieves when using all 16 cores as shown in Figure 
2. Interprocess communication becomes a major component of the Wolf simulations so the parallel 
scalability gets worse the more efficiently TRON conducts its work. Nonetheless, the higher quality 
computing resources of a Wolf node results in faster solve times than the solve time on Mustang even 
with lesser parallel efficiency. 

Table 7. Chromium plume migration in the subsurface: parameters 


Parameter Value 

100 m 


ai 

OiT 

Contaminant source (R42) 
At 

Domain size 
Dm 

Permeability 
Pumping well (R28) 

Total hexahedrons 
Total vertices 

V 

Viscosity 


0.1 m 

1 X 10“^ kg/m^s^ 

0.2 days 

7000 kmx6000 kmxlOO m 

1 X 10“® m^/s 

Varies 

-0.01 kg/m^s^ 

1,984,512 

2,487,765 

Varies with position 
3.95x10-5 Pa s 
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Figure 13. Chromium plume migration i 
times using the Galerkin formulation. 
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the subsurface: concentrations at select 
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Figure 14. Chromium plume migration in the subsurface: concentrations at select 
times using the non-negative methodology. 
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Figure 15. Chromium plume migration in the subsurface: wall-clock time of the TRON 
optimization solver with multi-grid preconditioner (GAMG) versus number of processors 
after the first time level. Two Wolf cases are considered, where we fully saturate a 
compute node (16 cores) and where we partially saturate a compute node (8 cores per 
node). The parallel efficiency with respect to 16 cores is shown on the right hand side. 

Another metric of interest is the number of solver iterations required for convergence across various 
number of MPI processes. Figure 16 depicts the number of KSP solver iterations and TAG solver 
iterations across 1024 cores, and we notice that there are significant fluctuations. This trend is largely 
attributed to the accumulation of numerical round-offs from the TRON algorithm. One can reduce 
these fluctuations by tightening the solver tolerances, but the strong-scaling remains largely unaffected 
even for the results shown. This study suggests that the proposed non-negative methodology using 
TRON with GAMG preconditioning is suitable for large-scale transient heterogeneious and anisotropic 
diffusion equations. 


6. CONCLUDING REMARKS 

We presented a parallel non-negative computational framework suitable for solving large-scale 
steady-state and transient anisotropic diffusion equations. The main contribution is that the proposed 
parallel computational framework satisfies the discrete maximum principles for large-scale diffusion-type 
equations even on general computational grids. The parallel framework is built upon PETSc’s DMPlex 
data structure, which can handle unstructured meshes, and TAO for solving the resulting optimization 
problems from the discretization formulation. We have conducted systematic performance modeling 
and strong-scaling studies to demonstrate the efficiency, both in the parallel and hardware sense of the 
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Figure 16. Chromium plume migration in the subsurface: number of KSP (left hand 
side) and TAO (right hand side) solver iterations for the TRON optimization solver 
versus number of cores after the first time level. 


computational framework. The robustness of the proposed framework has been illustrated by solving a 
large-scale realistic problem involving the transport of chromium in the subsurface at Los Alamos, New 
Mexico. Future areas of research include: (a) extending the proposed parallel framework to advective- 
diffusive and advective-diffusive-reactive systems, and (b) posing the discrete problem as a variational 
inequality, which will be valid even for non-self-adjoint operators, and use other PETSc capabilities to 
solve such variational inequalities. 
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